[HUDI-1013] Adding Bulk Insert V2 implementation#1834
[HUDI-1013] Adding Bulk Insert V2 implementation#1834vinothchandar merged 5 commits intoapache:masterfrom
Conversation
d03df7b to
8366e83
Compare
a5f608c to
e5d4939
Compare
vinothchandar
left a comment
There was a problem hiding this comment.
@nsivabalan I made a high level pass. Mostly lgtm. Will make some changes, test and land.
hudi-client/src/main/java/org/apache/hudi/client/HoodieInternalWriteStatus.java
Outdated
Show resolved
Hide resolved
hudi-client/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java
Outdated
Show resolved
Hide resolved
af27762 to
1a11d1c
Compare
c957008 to
dacd635
Compare
nsivabalan
left a comment
There was a problem hiding this comment.
Leaving some notes for reviewer :)
hudi-client/src/main/java/org/apache/hudi/client/HoodieInternalWriteStatus.java
Outdated
Show resolved
Hide resolved
hudi-client/src/main/java/org/apache/hudi/io/HoodieRowCreateHandle.java
Outdated
Show resolved
Hide resolved
hudi-client/src/main/java/org/apache/hudi/io/HoodieRowCreateHandle.java
Outdated
Show resolved
Hide resolved
hudi-client/src/main/java/org/apache/hudi/keygen/BuiltinKeyGenerator.java
Outdated
Show resolved
Hide resolved
hudi-client/src/main/java/org/apache/hudi/keygen/KeyGenerator.java
Outdated
Show resolved
Hide resolved
hudi-spark/src/main/java/org/apache/hudi/keygen/TimestampBasedKeyGenerator.java
Outdated
Show resolved
Hide resolved
hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
Outdated
Show resolved
Hide resolved
hudi-spark/src/test/java/org/apache/hudi/internal/TestHoodieDataSourceInternalWriter.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Note to reviewer: I am yet to add tests to these new methods. Got these as part of rebase. Also, I notice few other test classes for each key generators after rebasing. Will add tests by tmrw to those new test classes.
hudi-spark/src/test/scala/org/apache/hudi/TestDataSourceDefaults.scala
Outdated
Show resolved
Hide resolved
vinothchandar
left a comment
There was a problem hiding this comment.
Took a high level pass. Great work @bvaradar , @nsivabalan . Did some testing as well.
hudi-client/src/main/java/org/apache/hudi/client/HoodieInternalWriteStatus.java
Outdated
Show resolved
Hide resolved
hudi-spark/src/main/java/org/apache/hudi/keygen/GlobalDeleteKeyGenerator.java
Outdated
Show resolved
Hide resolved
hudi-spark/src/main/scala/org/apache/hudi/DataSourceOptions.scala
Outdated
Show resolved
Hide resolved
hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
Outdated
Show resolved
Hide resolved
hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
Outdated
Show resolved
Hide resolved
dacd635 to
06e9693
Compare
06e9693 to
d491a32
Compare
|
#1834 (comment) |
hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
Outdated
Show resolved
Hide resolved
nsivabalan
left a comment
There was a problem hiding this comment.
one nit on HoodieSparkSqlWriter.
866ee72 to
5e77ae3
Compare
vinothchandar
left a comment
There was a problem hiding this comment.
@bvaradar @nsivabalan this is a more detailed review with notes to self. Please take a look and chime in. I think we need to be careful about how we future proof the KeyGenerator API.
hudi-spark/src/main/java/org/apache/hudi/keygen/BuiltinKeyGenerator.java
Outdated
Show resolved
Hide resolved
hudi-spark/src/main/java/org/apache/hudi/keygen/BuiltinKeyGenerator.java
Outdated
Show resolved
Hide resolved
hudi-client/src/main/java/org/apache/hudi/client/AbstractHoodieWriteClient.java
Outdated
Show resolved
Hide resolved
hudi-client/src/main/java/org/apache/hudi/client/model/HoodieInternalRow.java
Show resolved
Hide resolved
hudi-client/src/main/java/org/apache/hudi/client/model/HoodieInternalRow.java
Outdated
Show resolved
Hide resolved
hudi-spark/src/main/java/org/apache/hudi/keygen/TimestampBasedKeyGenerator.java
Outdated
Show resolved
Hide resolved
| @Override | ||
| public String getPartitionPath(Row row) { | ||
| Object fieldVal = null; | ||
| Object partitionPathFieldVal = RowKeyGeneratorHelper.getNestedFieldVal(row, getPartitionPathPositions().get(getPartitionPathFields().get(0))); |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
hudi-spark/src/test/scala/org/apache/hudi/TestDataSourceDefaults.scala
Outdated
Show resolved
Hide resolved
5e77ae3 to
95a71fe
Compare
nsivabalan
left a comment
There was a problem hiding this comment.
some notes to reviewer.
hudi-spark/src/main/java/org/apache/hudi/keygen/GlobalDeleteKeyGenerator.java
Outdated
Show resolved
Hide resolved
hudi-spark/src/main/java/org/apache/hudi/keygen/TimestampBasedKeyGenerator.java
Outdated
Show resolved
Hide resolved
nsivabalan
left a comment
There was a problem hiding this comment.
some more comments.
hudi-spark/src/main/java/org/apache/hudi/keygen/BuiltinKeyGenerator.java
Outdated
Show resolved
Hide resolved
hudi-spark/src/main/java/org/apache/hudi/keygen/BuiltinKeyGenerator.java
Outdated
Show resolved
Hide resolved
hudi-spark/src/main/java/org/apache/hudi/keygen/BuiltinKeyGenerator.java
Outdated
Show resolved
Hide resolved
hudi-spark/src/main/java/org/apache/hudi/keygen/BuiltinKeyGenerator.java
Outdated
Show resolved
Hide resolved
hudi-spark/src/main/java/org/apache/hudi/keygen/TimestampBasedKeyGenerator.java
Outdated
Show resolved
Hide resolved
- Clean up KeyGenerator classes and fix test failures
95a71fe to
06c1370
Compare
hudi-client/src/main/java/org/apache/hudi/client/model/HoodieInternalRow.java
Show resolved
Hide resolved
hudi-client/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java
Outdated
Show resolved
Hide resolved
| public HoodieRowParquetWriteSupport(Configuration conf, StructType structType, BloomFilter bloomFilter) { | ||
| super(); | ||
| Configuration hadoopConf = new Configuration(conf); | ||
| hadoopConf.set("spark.sql.parquet.writeLegacyFormat", "false"); |
There was a problem hiding this comment.
yes. why we are hardcoding this. any ideas @bvaradar ?
hudi-common/src/main/java/org/apache/hudi/common/model/WriteOperationType.java
Outdated
Show resolved
Hide resolved
hudi-spark/src/main/scala/org/apache/hudi/DataSourceOptions.scala
Outdated
Show resolved
Hide resolved
hudi-spark/src/main/scala/org/apache/hudi/HoodieWriterUtils.scala
Outdated
Show resolved
Hide resolved
c8745f8 to
5dc8182
Compare
- Introduced KeyGeneratorInterface in hudi-client, moved KeyGenerator back to hudi-spark - Simplified the new API additions to just two new methods : getRecordKey(row), getPartitionPath(row) - Fixed all built-in key generators with new APIs - Made the field position map lazily created upon the first call to row based apis - Implemented native row based key generators for CustomKeyGenerator - Fixed all the tests, with these new APIs
5dc8182 to
865d8d6
Compare
|
@nsivabalan this is ready. I am going ahead and merging. I also re-ran the benchmark again . Seems to clock the same 30 mins against spark.write.parquet. Please carefully go over the changes I have made in the last commits here.. and see if anything needs follow on fixing. Our timelines are tight. we need to do it tomorrow, if at all |
| .forEach(f -> partitionPathPositions.put(f, | ||
| RowKeyGeneratorHelper.getNestedFieldIndices(structType, f, false))); | ||
| } | ||
| this.structType = structType; |
There was a problem hiding this comment.
may I know where is the structType being used ? AvroConversionHelper.createConverterToAvro used row.Schema() and so we may not need it. probably we should rename this to boolean positionMapInitialized.
nsivabalan
left a comment
There was a problem hiding this comment.
few minor comments
| // insert/bulk-insert combining to be true, if filtering for duplicates | ||
| boolean combineInserts = Boolean.parseBoolean(parameters.get(DataSourceWriteOptions.INSERT_DROP_DUPS_OPT_KEY())); | ||
| HoodieWriteConfig.Builder builder = HoodieWriteConfig.newBuilder() | ||
| .withPath(basePath).withAutoCommit(false).combineInput(combineInserts, true); |
There was a problem hiding this comment.
Hi, @nsivabalan Do you got any idea why we disable AutoCommit by default when creating HoodieWriteConfig?
What is the purpose of the pull request
Brief change log
Verify this pull request
This change added tests and can be verified as follows:
Committer checklist
Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.